Adaptive  Multi-modality  Sensor  Scheduling  for 
Detection  and  Tracking  of  Smart  Targets 

Chris  Kreucher,  Doron  Blatt,  and  Alfred  Hero  Keith  Kastella 

The  University  of  Michigan  General  Dynamics 

Dept,  of  Electrical  Eng.  and  Computer  Science  Advanced  Information  Systems 


Ann  Arbor,  MI  48109-2122 
{ckreuche,  dblatt,  hero}@umich.edu 


Abstract —  This  paper  considers  the  prohlem  of  sensor  schednl- 
ing  for  the  purposes  of  detection  and  tracking  of  “smart”  targets. 
Smart  targets  are  targets  that  are  able  to  detect  when  they  are 
under  surveillance  and  react  in  a  manner  that  makes  future 
surveillance  more  difficult.  We  take  a  reinforcement  learning 
approach  to  adaptively  schedule  a  multi-modality  sensor  so  as  to 
most  quickly  and  effectively  detect  the  presence  of  smart  targets 
and  track  them  as  they  travel  through  a  surveillance  region.  An 
optimal  scheduling  strategy,  which  would  simultaneously  address 
the  issue  of  target  detection  and  tracking,  is  very  challenging 
computationally.  To  avoid  this  difficulty,  we  advocate  a  two  stage 
approach  where  targets  are  first  detected  and  then  handed  off 
to  the  tracking  algorithm. 

I.  Introduction 

The  problem  of  sensor  scheduling  is  to  determine  the  best 
way  to  task  a  sensor  or  group  of  sensors  when  each  sensor 
may  have  many  modes  and  search  patterns.  Tasking  a  sensor 
may  include  such  choices  as  where  to  point,  what  mode  to 
use,  and  what  signal  to  transmit.  In  general,  sensors  must 
balance  complex  tradeoffs  between  competing  mission  goals, 
e.g.  detection  of  new  targets,  tracking  of  existing  targets,  and 
identification  of  existing  targets. 

An  optimal  sensor  scheduling  algorithm  will  depend  on  the 
posterior  distribution  of  the  system  state  conditioned  on  sensor 
measurements.  In  our  application,  the  system  state  describes 
probabilistically  both  the  uncertainty  in  number  of  targets  and 
locations  of  the  individual  targets.  In  principle,  one  could 
derive  an  optimal  scheduling  algorithm  that  simultaneously 
treats  detection  of  new  targets  and  tracking  of  existing  targets 
by  defining  an  appropriate  global  reward.  However,  in  practice, 
this  is  very  difficult  due  to  computational  considerations.  To 
combat  these  computational  challenges,  we  take  a  modular 
approach  and  treat  the  problem  in  two  stages  -  target  detection 
followed  by  target  tracking.  This  suboptimal  algorithm  can  be 
viewed  as  an  approximation  to  an  optimal  algorithm  which 
simultaneously  considers  detection  and  tracking. 

Sensor  scheduling  is  complicated  substantially  when  targets 
under  surveillance  are  able  to  detect  and  respond  to  sensing 
activities  (so  called  “smart”  targets).  In  this  paper,  we  consider 
one  such  scenario.  Specifically,  we  investigate  the  situation 
where  a  sensor  is  charged  with  detecting  and  tracking  a  group 
of  moving  ground  targets  and  the  targets  have  the  ability 
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to  detect  some  of  the  surveillance  actions  and  respond  by 
concealing  their  whereabouts. 

The  paper  proceeds  as  follows.  In  Section  II,  we  outline 
the  mathematics  and  strategy  of  our  two  stage  detection  and 
tracking  algorithm.  We  first  give  an  overview  of  reinforce¬ 
ment  learning  methods,  and  then  describe  the  application  of 
reinforcement  learning  to  the  target  detection  stage  and  the 
tracking  stage.  In  Section  III,  we  provide  simulation  results  of 
the  algorithm  for  two  smart  targets.  The  method  is  compared 
to  random  and  myopic  strategies  and  shown  to  provide  good 
performance.  Einally,  in  Section  IV  we  conclude  with  some 
summarizing  remarks. 

II.  Smart  Target  Detection  and  Tracking 

In  this  section,  we  describe  the  details  of  our  two  stage  de¬ 
tection  and  tracking  algorithm.  We  first  review  reinforcement 
learning  and  then  show  its  application  to  each  stage. 

A.  Reinforcement  Learning  for  Optimal  Solution  of  a  MDP 

The  problem  of  detecting  and  tracking  smart  targets  can 
be  formulated  as  an  infinite-horizon  Markov  Decision  Process 
(MDP)  [13].  It  is  well  known  that  the  complexity  of  finding 
optimal  policies  for  MDP  grows  exponentially  with  the  state 
and  action  spaces  [2].  Since  the  sensor  scheduling  problem  is 
characterized  by  extremely  large  state  and  action  spaces,  it  is 
necessary  to  develop  approximate  solutions  using  dimension 
reduction.  We  advocate  methods  from  reinforcement  learning 
coupled  with  function  approximation  to  find  approximately 
optimal  policies  for  the  two  stages. 

1)  Infinite-Horizon  MDP:  A  discounted-reward  infinite- 
horizon  MDP  is  defined  by  a  sequence  of  states  {S't}t>o 
taking  values  in  a  state  space  S,  a  sequence  of  actions  {At}t>o 
taking  values  in  an  action  space  A,  and  a  (possibly  random) 
reward  function  r{St,At)  that  assigns  the  cost  incurred  (when 
negative)  or  the  reward  gained  (when  positive)  to  the  event  of 
being  at  state  St  and  taking  action  At .  In  our  context,  the  state 
space  characterizes  the  battlefield.  It  contains  rich  information 
such  as  the  number  of  targets  present,  their  location,  their  type, 
and  whether  they  are  stationary  or  moving.  The  action  space 
contains  all  the  possible  actions.  Each  action  specifies  which 
sensors  to  use,  their  mode  of  operation,  and  where  to  point 
them.  The  reward  system  reflects  the  tradeoffs  between  costs 
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of  deploying  a  certain  sensor  and  the  gain  earned  from  the 
measurement  it  collects. 

The  process  is  initiated  with  state  5'o  followed  by  action 
Aq  chosen  by  the  controller  and  continues  with  the  sequence 
S'!,  Ai,  82,  A2,  ■  ■ ..  Under  the  Markovian  model,  given  St  and 
At,  is  independent  of  all  past  states  and  actions.  The 

state  transitions  are  governed  by  a  stationary  probabilistic  law, 
denoted  by  p{St+i\St,  At),  that  specifies  the  distribution  of 
St+i  over  S,  given  St  and  At.  p{St+i\St,  At)  is  either  a 
probability  density  function  when  the  state  space  is  continuous 
or  a  probability  mass  function  when  it  is  discrete. 

A  stationary  policy  If  is  a  map  from  5  to  ^  that  specifies 
the  action  taken  at  each  state.  Denote  the  class  of  all  policies 
by  V.  The  value  function  associated  with  policy  If,  denoted 
by  U'^(s)  is  the  expected  total  discounted  reward  when  being 
in  state  St  =  s  and  following  policy  If,  that  is 


00 


U'Hs)  =  E  <;  }_^P^-^r{SrMSr))\St  =  s}  Vs  G  5 


(1) 


where  /?  G  (0, 1)  is  a  discount  factor,  which  is  included 
to  value  future  rewards  less  than  immediate  rewards.  This 
expectation  is  taken  with  respect  to  the  joint  distribution  of 
all  the  targets,  which,  in  the  context  of  smart  targets,  is 
highly  dependent  on  the  action  sequence.  Therefore,  a  direct 
calculation  of  this  expression  is  computationally  intractable. 
An  optimal  policy  is  a  policy  that  satisfies 


n*(s)  =  argmaxU^(S')  Vs  G  5  .  (2) 

neP 

It  is  well  known  that  the  optimal  policy  is  the  unique  solu¬ 
tion  to  Bellman’s  equation.  Unfortunately,  when  the  state  and 
action  spaces  are  large  and  the  state  transition  density  is  either 
computationally  complicated  or  not  explicitly  available,  this 
methodology  is  intractable  and  one  must  resort  to  approximate 
solutions  such  as  Q-learning  [2]. 

2)  Q-Learning:  The  optimal  scheduling  policy  for  the 
two  stages  is  found  using  Q-learning  coupled  with  function 
approximation  [17],  [15],  [16].  The  learning  part  relaxes  the 
requirement  for  an  explicit  knowledge  of  the  transition  density, 
and  function  approximation  is  used  to  further  reduce  the 
dimensionality  of  the  state  and  action  spaces. 

Given  the  optimal  value  function  V*,  the  Q-function  is 


(5(s,a)  =  E{r(s,  a) -l-/3U*(5't+i)|S't  =  =  a}  ,  (3) 

i.e.,  the  expected  reward  when  taking  action  a  at  state  s  and 
then  acting  optimally.  The  Q-function  satisfies  the  equation 


Q{s,a)  =  E  r(s,  a)  +  PinaxQ{St+i,a)\St  =  s,  At  =  a  > 

[  aGA  J 

(4) 

Given  the  Q-function,  optimal  actions  are  computed  as 

argmax  (5(S't,  a)  .  (5) 

aeA 


In  Q-learning  the  Q-function  is  estimated  from  multiple 
trajectories  of  the  process.  Assume  first  that  both  S  and  A 
are  finite.  Then,  there  exists  a  lookup  table  representation  of 
Q{s,  a).  In  this  case,  given  an  arbitrary  initial  value  of  Q{s,  a), 
the  one-step  Q-learning  algorithm  ([15],  p.  148)  is  given  by 
the  repeated  application  of  the  update  equation 


)  ,  (6) 

where  each  of  the  4-tuples  {St  =  s,At  =  a,  St+i  =  s' ,  Rt  = 
r}  are  incurred  during  the  progress  of  the  MDP,  and  7  G  (0, 1) 
decreases  with  t.  In  most  realistic  problems  (the  problems 
discussed  herein  included)  it  is  infeasible  to  represent  the  Q- 
function  in  a  lookup  table,  either  because  the  number  of  states 
is  too  large  or  simply  because  the  state  space  is  continuous. 
Therefore,  we  require  a  function  approximation  technique  to 
represent  the  Q-function.  The  standard  and  simplest  class  of 
Q-function  approximators  are  linear  combinations  of  basis 
functions  (also  called  features),  i.e.  Q{s,a)  =  6’^(f>{s,a), 
where  (j){s,  a)  :  S  x  A  ^  is  a  feature  vector  associated 
with  state  s  and  action  a  and  the  coefficients  of  0  G  are  to 
be  estimated.  Gradient  descent  is  used  with  the  training  data 
to  update  the  estimate  of  6,  i.e. 

9  ^  6  +  j  (^r  +  fim^xQl^s' ,  a')  —  Q{s,  a)^  V gQ{s,  a) 

=  0 -f  7  -f /3 max o')  —  a)^  a)  , 

Once  the  learning  of  the  vector  6  is  completed,  optimal  actions 
can  be  computed  according  to  argmaxag_4  a). 

B.  Detection  of  Smart  Targets  using  Reinforcement  Learning 

The  target  detection  stage  is  formulated  as  a  Bayesian 
hypothesis  testing  problem  in  which  one  is  trying  to  decide 
between  M  >  2  hypotheses:  The  observed 

system  is  modelled  as  a  MDP  with  a  finite  state  space  S 
with  cardinality  N.  Each  hypothesis  corresponds  to  a  different 
subset  of  the  states  and  it  is  assumed  that  there  are  no 
transitions  between  states  that  are  associated  with  different 
hypotheses. 

At  each  time  t,  one  of  K  modes  denoted  by  Si, . . . 
is  used  to  collect  a  measurement  Zj,  or  alternatively  a  final 
decision  is  made.  The  possible  actions  available  at  each  time 
are  A  =  {Si, . . . ,  S/f ,  D},  where  D  stands  for  the  action  of 
making  the  final  decision.  After  action  D  the  detection  process 
ends  and  a  reward  is  granted  for  a  correct  decision. 

Denote  by  /fc(z|s)  the  conditional  density  of  a  measurement 
collected  by  mode  k  given  the  system  is  at  state  s.  The  state 
transition  probabilities  of  the  Markov  process  p(S't+i  At) 
depend  on  the  deployed  sensor  mode.  The  possible  states  in 
S  are  enumerated  from  1  to  and  the  transition  probabilities 
are  summarized  in  the  matrices  A/j,  k  =  where 

[Afc]„;  =  p{St+i  =  n\St  =  (,Sfc),  n,l  =  is  the 

probability  that  the  system  moves  from  state  I  to  state  n  when 
sensor  mode  k  is  used. 


Q{s,  a)  <—  (1  —  j)Q(s,  a)  Aj  I  r  +  /3max(5(s',  a) 

CcGA 


The  dependency  on  the  deployed  sensor  mode  is  applicable 
when  a  target  can  sense  that  it  is  being  observed  and  may 
react  accordingly,  e.g.  hide  or  unfold  its  radar  antenna.  Since 
the  number  of  states  is  finite  and  known,  we  can  use  the  vector 
notation  pt,  to  denote  the  posterior  probability  vector  of  the 
target  states  given  Z^.  Using  this  notation,  when  sensor  k  was 
deployed  and  collected  measurement  zj+i,  the  time  update  is 


Pt+i  = 


Afcdiag([/fc(zt+i|l),  ■ .  ■ ,  fk{zt+i\N)])pt 
sum(Afediag([/fe(zt+i|l), . . . ,  fk{zt+i\N)])pt) 


where  fk{zt+i\n)  denotes  the  conditional  density  of  a  mea¬ 
surement  that  was  collected  by  sensor  k  given  that  the  system 
is  in  state  n,  and  for  any  vector  v,  diag(v)  is  a  diagonal  matrix 
with  the  elements  of  v  on  its  diagonal,  and  sum(v)  is  the  sum 
of  its  elements.  Therefore,  a  policy  can  be  defined  as  a  map 
from  ,  the  simplex  of  N-dimensional  probability  vectors,  to 
A.  The  expected  total  reward  at  information  state  pt  becomes 


13^  ‘r(p^,n(p^))|  ,  (8) 

with  optimal  policy  is  If*  =  arg  maxnep  U'^(p).  The  Q- 
function  is  defined  over  the  A-dimensional  simplex  and 
for  any  action  a  G  A  by 


Un(pO  =  E  £ 


Q{pt,a)  =E{r{pt,a)  +  (3V*{pt+i)}  ,  (9) 

which  is  the  expected  reward  when  taking  action  a  at  in¬ 
formation  state  Pt  and  then  acting  optimally.  As  described 
earlier,  the  dimensionality  of  the  information  state  space  is 
reduced  by  a  linear  parametrization,  and  Q-learning  is  used  to 
approximate  the  Q-function.  Given  Q,  one  finds  the  optimal 
policy  by  taking  the  action  that  maximizes  it  at  any  given 
information  state. 

C.  Tracking  of  Smart  Targets  using  Reinforcement  Learning 

Tip-offs  from  the  detection  algorithm  are  used  to  initialize 
a  tracking  algorithm  which  finely  geolocates  and  tracks  mov¬ 
ing  targets.  Targets  are  tracked  by  recursively  estimating  a 
conditional  probability  density  known  as  the  Joint  Multitarget 
Probability  Density  (JMPD)  [7],  [8].  In  this  paper,  we  restrict 
ourselves  to  the  case  where  the  number  of  targets  is  known  and 
fixed  and  the  state  vectors  of  individual  targets  are  a  scalar. 
More  general  implementations  are  given  in  [8]. 

1)  The  JMPD  and  Particle  Filter  Approximation:  In  the 
tracking  stage,  the  state  s  of  the  system  (see  Section  II-A) 
is  given  by  the  joint  multitarget  probability  density.  In  this 
subsection,  we  show  how  the  state  is  derived  and  how  states 
are  combined  with  measurements  to  determine  the  next  state. 

We  define  the  joint  multitarget  conditional  probability  den¬ 
sity  p(X(,Xj,  ,if[\Zt,Tt)  as  the  probability  for  T  tar¬ 

gets  with  states  x^,  x^,  ...x^“^, x^  at  time  t  based  on  a  set  of 
observations  Z*.  As  before,  Z*  refers  to  the  collection  of  mea¬ 
surements  up  to  and  including  time  t,  i.e.  Zj  =  {zi,  Z2,  ...z*}, 
where  each  of  the  Zi  may  be  a  single  measurement  or  a  vector 
of  measurements  made  at  time  i.  Each  of  the  state  vectors  x* 


in  the  JMPD  is  a  vector  quantity  and  may  (for  example)  be 
of  the  form  [a;,  i,  y,  t)]^  For  convenience,  the  density  will  be 
written  more  compactly  as  p(Xt,rt|Zt). 

The  sample  space  of  X  is  very  large.  It  contains  all  possible 
configurations  of  state  vectors  x*.  We  find  that  a  particle 
filter  based  representation  of  the  JMPD  allows  tractable  im¬ 
plementation  [8].  The  particle  filter  approximation  represents 
the  JMPD  by  a  collection  of  weighted  samples,  i.e. 

p(X,T|Z)«  ^  u;p5(X-Xp)  .  (10) 

p=i 

2 )  Information  Based  Myopic  Sensor  Management:  We  use 
the  JMPD  to  make  tasking  decisions.  A  good  measure  of  the 
quality  of  an  action  is  the  reduction  in  entropy  expected  to  be 
induced.  Therefore,  the  reward  (see  Section  II-A)  will  be  given 
by  the  information  gained.  To  schedule  a  sensor,  we  enumerate 
all  possible  sensing  actions  and  calculate  the  expected  gain  in 
information  associated  with  each  possible  action. 

The  calculation  of  information  gain  between  two  densities 
/i  and  /o  is  done  using  the  Renyi  information  divergence  [14], 
[5],  also  known  as  the  a-divergence; 

DMiWfo)  = J  fi{x)f^~°‘{x)dx  .  (11) 

In  our  application,  we  are  interested  in  computing  the 
divergence  between  the  predicted  density  p(Xt+i|Zi)  and  the 
updated  density,  p(Xt+i|Zt+i). 

We  choose  the  sensing  action  that  makes  the  divergence 
between  the  current  density  and  the  density  after  a  new 
measurement  largest.  Since  we  do  not  know  the  outcome  of  a 
sensing  action  until  after  the  action  is  taken,  we  calculate  the 
expected  divergence  and  use  this  to  schedule  the  sensor.  The 
expected  value  may  be  written  formally  as  an  integral  over  all 
possible  outcomes  z  when  performing  sensing  action  m,  i.e. 


\\Da\\m  =  J  dzp{z\Zt,m)Da{p{-\Zt,z)\\p{-\Zt))  .  (12) 

3)  Information  Based  Non-myopic  Sensor  Management: 
As  discussed  in  Section  II-A,  in  many  situations  a  non- 
myopic  sensor  management  strategy  provides  sensor  tasking 
decisions  having  better  performance  than  the  myopic  strategy. 
In  particular,  in  the  setting  considered  here  where  targets  are 
“smart”  and  react  to  sensing  actions,  the  regret  of  choosing  a 
poor  action,  e.g.  active  sensing,  is  long  lasting  as  the  effect  of 
an  action  persists  over  time.  Therefore,  a  non-myopic  strategy 
will  be  far  superior  to  a  myopic  strategy. 

We  use  Q-learning  with  linear  function  approximation  to 
learn  a  policy  which  behaves  non-myopically.  The  training 
process  involves  generation  of  {state,  action,  next  state,  im¬ 
mediate  reward}  4-tuples  over  a  large  number  of  training 
episodes.  In  the  training  process,  the  immediate  reward  of  an 
action  is  computed  using  the  actual  gain  in  information  as 
measured  by  the  Renyi  Divergence. 


III.  Simulation  Results 


We  consider  a  model  problem  in  which  an  airborne  platform 
is  trying  to  detect  and  track  a  set  of  ground  targets.  The 
airborne  platform  has  available  a  multimode  sensor  that  is 
able  to  use  an  active  mode  (e.g.  radar)  or  a  passive  mode 
(e.g.  EO/IR).  The  sensor  is  able  to  quickly  steer  an  antenna 
so  as  to  focus  attention  on  specific  regions  of  the  surveillance 
area.  This  is  a  simple  model  of  a  real  platform  like  the  USAF 
JSTARS,  which  has  a  24ft  antenna  installed  on  the  underside 
of  the  aircraft,  is  able  to  scan  electronically  in  azimuth  and  is 
able  to  choose  between  several  modes  of  operation  including 
moving  target  indicator  and  synthetic  aperture  radar. 

In  this  simulation,  targets  are  characterized  by  their  position 
in  one  dimension.  Targets  are  “smart”  in  that  they  sense  when 
they  are  under  surveillance  by  an  active  sensor  and  react  to 
make  future  surveillance  activities  more  difficult.  The  number 
and  location  of  the  targets  is  unknown  initially  and  our  task  is 
to  detect  and  track  the  targets.  The  model  problem  considered 
here  is  summarized  in  Figure  1. 
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Fig.  1.  An  illustration  of  the  model  problem.  The  surveillance  region  is 
broken  into  detection  regions.  The  detection  algorithm  schedules  the  sensor 
to  most  quickly  determine  the  presence  or  absence  of  targets  in  each  detection 
region.  Upon  detecting  targets,  the  tracking  algorithm  is  tipped  off  with  the 
regions  in  which  targets  exist.  The  tracking  algorithm  then  determines  sensor 
resource  allocations  that  allows  refinement  of  the  initial  location  and  tracking 
as  the  targets  move  through  the  surveillance  area. 

A.  Target  Detection 

Fach  detection  region  is  modelled  as  taking  one  of  three 
states:  Si  no  target  present,  S2  an  exposed  target  is  present,  and 
S3  a  camouflaged  target  is  present.  There  are  two  hypotheses: 
Hi  (no  target  present)  and  H2  (a  target  is  present,  exposed  or 
camouflaged).  The  target  can  move  from  state  2  to  state  3  if 
it  senses  that  it  is  being  observed.  However,  it  has  a  tendency 
to  return  from  state  3  to  state  2  if  it  no  longer  senses  that  it 
is  being  observed,  e.g.  it  may  be  less  effective  in  state  3. 

Intelligence  sources  provide  a  prior  on  the  initial  state  of 
the  target,  which  constitutes  the  initial  information  state  of 
the  process  po.  The  platform  has  one  of  three  sensor  modes 
to  deploy.  Sensor  mode  i,  deployed  at  time  t  provides  an 
independent  measurement  Zi{t).  For  the  simulation  considered 
here,  measurements  are  assumed  conditionally  Gaussian. 

Modes  1  and  3  represent  active  modes,  which  can  be  sensed 
by  the  target,  and  sensor  mode  2  represents  a  passive  mode 
which  cannot  be  detected  by  the  target.  When  the  target  is  in 
hide  mode,  it  has  an  incentive  to  return  to  the  exposed  state. 
Sensor  mode  3  is  less  favorable  then  sensor  mode  1  regardless 


pi 

Fig.  2.  An  improvement  over  the  random  allocation  policy. 

of  the  system  state.  It  provides  less  information  on  the  target, 
and  when  it  is  used  there  is  a  higher  probability  that  the  target 
will  detect  it.  It  was  included  in  this  study  in  order  to  show 
that  the  optimal  learned  policy  will  indeed  never  use  it. 

Q-learning  (Section  II-A.2)  was  used  to  approximate  the 
optimal  policy.  The  basis  functions  (or  features)  were  chosen 
to  be  indicator  functions  of  disjoint  regions  of  5^  x  .4  that 
correspond  to  quantization  of  the  simplex  into  55  disjoint 
regions  for  each  action  in  A. 

The  myopic  strategy  for  this  problem  is  to  make  an  immedi¬ 
ate  decision  based  on  the  prior  with  taking  any  measurements. 
Therefore,  the  estimated  optimal  policy  was  compared  to  a 
randomized  policy  in  which  actions  are  chosen  uniformly. 
The  improvement  in  terms  of  the  difference  in  averaged 
value,  estimated  from  2000  Monte  Carlo  simulations  at  each 
information  state,  is  presented  in  Fig.  2. 

B.  Target  Tracking 

We  assume  the  target  detection  algorithm  has  detected 
targets  in  Regions  1  and  3  and  passed  this  information  to 
the  target  tracking  algorithm.  At  each  time  step,  the  sensor 
is  able  to  measure  a  single  cell  to  determine  the  presence  or 
absence  of  targets.  The  sensor  can  use  the  active  (mode  1)  or 
passive  (mode  2)  modes  described  above.  Sensor  modes  are 
characterized  by  a  detection  probability  Pd  and  a  false  alarm 
probability  Pf.  These  probabilities  are  linked  together  via  SNR 
by  Pd  =  pV4+‘S'Ni?)  model  of  sensor  returns  corre¬ 

sponds  to  thresholding  of  Rayleigh  distributed  energy  from 
targets  in  Rayleigh  distributed  background  noise  as  is  seen 
on  GMTI  radar  systems.  Note  that  the  sensor  characteristics 
are  defined  differently  than  in  the  detection  portion  of  the 
algorithm.  Unlike  the  detection  regions  considered  earlier,  a 
sensor  cell  is  now  a  small  area  and  targets  can  easily  move 
between  cells  necessitating  the  fine  grained  model. 

When  the  target  is  in  visible  mode,  the  active  mode  works 
with  high  detection  probability  and  low  false  alarm  probability. 


Pd  =  .9  and  Pf  =  le— 4  (corresponding  to  SNR  =  20dB).  The 
passive  sensor  mode  works  with  detection  probability  Pd  = 
.5  and  false  alarm  probability  P/  =  le  —  4  (SNR  =  lOdB). 
When  in  hide  mode,  both  modes  are  severely  degraded  and 
correspond  to  a  target  with  SNR  =  OdB. 

Targets  can  sense  when  the  active  mode  is  used  and  move 
into  hide  mode  to  prevent  further  interrogation.  Additionally, 
targets  that  have  moved  into  hide  mode  tend  to  move  back 
into  visible  mode  when  the  passive  sensor  mode  is  used.  The 
parameters  of  interest  can  be  summarized  by  the  following 
transition  probabilities  when  for  each  of  the  two  sensor  modes: 

Pr {visible  to  visible)  Pr {visible  to  hide) 

Pr{hide  to  visible)  Pr{hide  to  hide) 

A  myopic  strategy  makes  tasking  decisions  based  only  on 
the  expected  immediate  reward.  Here  the  myopic  strategy  will 
advocate  using  the  active  mode  at  all  times.  Depending  on  the 
transition  probabilities,  this  may  immediately  force  the  targets 
into  hide  mode,  making  them  difficult  to  observe  in  future 
time  steps.  A  non-myopic  strategy,  on  the  other  hand,  will  take 
into  account  the  effect  of  current  actions  on  future  information 
gaining  ability  and  be  more  prudent  in  using  the  active  mode. 

In  the  simulation,  we  use 

Transition  Matrix  Active  Sensor  Mode  =  0  1 

Transition  Matrix  Passive  Sensor  Mode  =  o  o 

.Z  .O 

which  indicates  that  the  target  always  moves  into  hide  when 
the  active  mode  is  used,  and  moves  from  hide  to  visible  with 
probability  .2  when  the  active  mode  is  used. 

We  trained  a  Q-function  as  discussed  in  Section  II.  In 
Figure  3,  we  present  results  of  target  localization  using  the 
Q-learning  strategy  detailed  in  Section  II.  We  compare  this 
performance  to  (a)  a  random  strategy,  (b)  a  myopic  strategy, 

(c)  a  random  strategy  that  only  uses  the  passive  mode,  and 

(d)  a  myopic  strategy  that  only  uses  the  passive  mode.  The 
Q-learning  strategy  performs  as  well  or  better  than  the  best  of 
the  four  competing  strategies  in  both  cases. 

IV.  Conclusion 

In  this  paper,  we  have  investigated  the  problem  of  sensor 
scheduling  for  detection  and  tracking  of  smart  moving  ground 
targets  from  an  airborne  sensor.  Since  the  targets  of  interest 
are  able  to  detect  and  respond  to  certain  sensing  actions, 
it  is  mandatory  that  the  long  term  ramifications  be  taken 
into  account  when  choosing  current  sensing  actions.  This 
necessity  for  non-myopic  sensor  scheduling  leads  to  a  very 
computationally  challenging  problem. 

We  have  addressed  this  numerical  challenge  with  a  two 
stage  approach,  where  both  stages  are  solved  using  reinforce¬ 
ment  learning.  The  surveillance  area  is  first  partitioned  into  a 
set  of  detection  regions  and  a  detection  algorithm  determines 


Fig.  3.  Target  tracking  performance.  Included  are  a  random  strategy,  a  myopic 
strategy,  a  random  strategy  that  uses  only  the  passive  mode,  a  myopic  strategy 
that  uses  only  the  passive  mode,  and  the  Q-learning  strategy. 

the  presence  or  absence  of  a  target  in  each  region.  Upon 
detection,  a  tracking  algorithm  is  used  to  finely  geolocate  and 
track  targets  as  they  move  through  the  region. 
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